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Abstract 

The problem of model selection arises in a number of contexts, such as subset selection in linear regression, 
estimation of structures in graphical models, and signal denoising. This paper studies non-asymptotic model selection 
for the general case of arbitrary (random or deterministic) design matrices and arbitrary nonzero entries of the signal. 
In this regard, it generalizes the notion of incoherence in the existing literature on model selection and introduces 
two fundamental measures of coherence — termed as the worst-case coherence and the average coherence — among 
the columns of a design matrix. It utilizes these two measures of coherence to provide an in-depth analysis of a 
simple, model-order agnostic one-step thresholding (OST) algorithm for model selection and proves that OST is 
feasible for exact as well as partial model selection as long as the design matrix obeys an easily verifiable property, 
which is termed as the coherence property. One of the key insights offered by the ensuing analysis in this regard is 
that OST can successfully cany out model selection even when methods based on convex optimization such as the 
lasso fail due to the rank deficiency of the submatrices of the design matrix. In addition, the paper establishes that if 
the design matrix has reasonably small worst-case and average coherence then OST performs near-optimally when 
either (i) the energy of any nonzero entry of the signal is close to the average signal energy per nonzero entry or 
(ii) the signal-to-noise ratio in the measurement system is not too high. Finally, two other key contributions of the 
paper are that (i) it provides bounds on the average coherence of Gaussian matrices and Gabor frames, and (ii) it 
extends the results on model selection using OST to low-complexity, model-order agnostic recovery of sparse signals 
with arbitrary nonzero entries. In particular, this part of the analysis in the paper implies that an Alltop Gabor frame 
together with OST can successfully carry out model selection and recovery of sparse signals irrespective of the phases 
of the nonzero entries even if the number of nonzero entries scales almost linearly with the number of rows of the 
Alltop Gabor frame. 
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I. Introduction 

A. Background 

In many information processing and statistics problems involving high-dimensional data, the "curse of di- 
mensionality" can often be broken by exploiting the fact that real-world data tend to live in low-dimensional 
manifolds. This phenomenon is exemplified by the important special case in which a data vector f3 £ C satisfies 
ll/^llo == X^iLi l{|^*i|>o} < k p and is observed according to the linear measurement model y = X/3 + 77. Here, 
X is an n X p (real- or complex-valued) matrix called the measurement or design matrix, while 77 G C" represents 
noise in the measurement system. In this problem, the assumption that the data vector /3 is "fc-sparse" allows one to 
operate in the so-called "compressed" setting, fc < n <C p, thereby enabling tasks that might be deemed prohibitive 
otherwise because of either technological or computational constraints. 

Fundamentally, given a measurement vector y = Xf3+i] in the compressed setting, there are three complementary — 
but nonetheless distinct — questions that one needs to answer; 

[Estimation] Under what conditions can we obtain a reliable estimate of a fc-sparse /3 from y7 
[Regression] Under what conditions can we reliably approximate X(3 corresponding to a fc-sparse /3 from yl 
[Model Selection] Under what conditions can we reliably recover the locations of the nonzero entries of a 
fc-sparse /3 (in other words, the model S = {i E {1, . . . ,p} : > 0}) from yl 
A number of researchers have attempted to address the estimation and the regression question over the past several 
years. In many application areas, however, the model-selection question is equally — if not more — important than the 
other two questions. In particular, the problem of model selection (sometimes also known as variable selection or 
sparsity pattern recovery) arises indirectly in a number of contexts, such as subset selection in linear regression IH, 
estimation of structures in graphical models f2l, and signal denoising 13]. In addition, solving the model-selection 
problem in some (but not all) cases also enables one to solve the estimation and/or the regression problem. 

B. Main Contributions 

Model Selection: One of the primary objectives of this paper is to study the problem of polynomial time, model- 
order agnostic model selection in a compressed setting for the general case of arbitrary (random or deterministic) 
design matrices and arbitrary nonzero entries of the signal. In order to accomplish this task, we introduce in the 
paper two fundamental measures of coherence among the (normalized) columns {x^ e C"} of the n x p design 
matrix X, namelyQ 

• Worst-Case Coherence: IJ.{X) = max (xi,Xj) , and 



Average Coherence: i^iX) = max 



'Here, and throughout the rest of this paper, we assume without loss of generality that X has unit £2-norm columns. This is because deviations 
to this assumption can always be accounted for by appropriately scaling the entries of the data vector /3 instead. 
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Algorithm 1 The One-Step Thresholding (OST) Algorithm for Model Selection 




{Form signal proxy} 
{Select model via OST} 



Roughly speaking, worst-case coherence — which seems to have been introduced in the related literature in H — is 
a similarity measure between the columns of a design matrix: the smaller the worst-case coherence, the less similar 
the columns. On the other hand, average coherence — which was first introduced in a conference version of this 
paper IS) — is a measure of the spread of the columns of a design matrix within the n-dimensional unit ball: the 
smaller the average coherence, the more spread out the column vectors. 

Our main contribution in the area of model selection is that we make use of these two measures of coherence 
to propose and analyze a model-order agnostic threshold for the one-step thresholding (OST) algorithm (see 
Algorithm [Til for model selection. Specifically, we characterize in Section both the exact and the partial model- 
selection performance of OST in a non-asymptotic setting in terms of fi and ly. In particular, we establish in Sectionllll 
that if fJ,{X) X 77,^^/^ and i^iX) ^ nT^ then OST — despite being computationally primitive — can perform near- 
optimally for the case when either (i) the energy of any nonzero entry of (3 is not too far away from the average 
signal energy per nonzero entry ||/3||2/fc or (ii) the signal-to-noise ratio (SNR) in the measurement system is not 
too highly Equally importantly, in contrast to some of the existing literature on model selection, this analysis in the 
paper holds for arbitrary values of the nonzero entries of (3 and it does not require the n x k submatrices of the 
design matrix X to have full column rank. 

Sparse-Signal Recovery: The second main objective of this paper is to study the problem of low-complexity, 
model-order agnostic recovery of k-sparse signals with arbitrary nonzero entries in the noiseless case. In this regard, 
our main contribution in the area of sparse-signal recovery is that we make use of a recent result by Tropp lU in 
Section Uni to extend our results on model selection to recovery of fc-sparse signals using OST (see Algorithm |2]i. In 
particular, we establish in Section |IV] that Gabor frames — which are collections of time- and frequency-shifts of a 
nonzero seed vector (sequence) in C" — can potentially be used together with OST to exactly recover most fc-sparse 
signals with arbitrary nonzero entries as long as k ^ log ^ energy of any nonzero entry of /3 is not too 

far away from ||/3||2/fc- This result then applies immediately to Gabor frames generated from the Alltop sequence 
Q. Specifically, since Gabor frames generated from the Alltop sequence have worst-case coherence /i = for 
any prime n > 5 18J, this result implies that an Alltop Gabor frame together with OST successfully recovers most 
fc-sparse signals irrespective of the values of the nonzero entries of /3 as long as fc ^ n/logn and and the energy 
of any nonzero entry of /3 is not too far away from ||/3||2/fc- 

^Recall "Big—0" notation: f{n) = 0{g{n)) (alternatively, /(n) ^ gin)) if 3 Co > 0, n-o : V n > rio, f{n) < Cog{n), f{n) = f2(g(n)) 
(alternatively, /(n) gin)) if = 0{f{n)), and /(n) = 0{g{n)) (alternatively, /(n) X g(n)) if g(n) ^ /(n) ^ 
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Algorithm 2 The One-Step Thresholding (OST) Algorithm for Sparse-Signal Recovery 



Input: An n x p matrix X, a vector y e C", and a threshold A > 




{Initialize} 
{Form signal proxy} 
{Select indices via OST} 
{Recover signal via least-squares} 



C. Relationship to Previous Work 

The problems of model selection and sparse-signal recovery in general and the use of OST (also known as 
simple thresholding ||9| and marginal regression ifTOl ) to solve these problems in particular have a rich history 
in the literature. In the context of model selection in the compressed setting. Mallow's Cp selection procedure 
ifTTl and the Akaike information criterion (AIC) W2\ — both of which essentially attempt to solve a complexity- 
regularized version of the least-squares criterion — are considered to be seminal works, and are known to perform 
well empirically as well as theoretically; see, e.g., ifTSl and the references therein. These two procedures have 
been modified by numerous researchers over the years in order to improve their performance — the most notable 
variants being the Bayesian information criterion (BIC) [141 ™d the risk inflation criterion (RIC) ifTSl . Solving 
model-selection procedures such as Cp, AIC, BIC, and RIC, however, is known to be an NP-hard problem lfT6l 
even if the true model order k is made available to these procedures. 

In order to overcome the computational intractability of these model-selection procedures, several methods based 
on convex optimization have been proposed by various researchers in recent years. Among these proposed methods, 
the lasso ITTI has arguably become the standard tool for model selection, which can be partly attributed to the 
theoretical guarantees provided for the lasso in ||2|, lfT8l - ll20l . In particular, the results reported in |]2|, lITSll establish 
that the lasso asymptotically identifies the correct model under certain conditions on the design matrix X and the 
sparse vector /3. Later, Wainwright in |fT9l strengthens the results of ||2l, ifTSi and makes explicit the dependence of 
exact model selection using the lasso on the smallest (in magnitude) nonzero entry of j3. However, apart from the 
fact that the results reported in 121, ifTSl . |fT9l are for exact model selection and are only asymptotic in nature, the 
main hmitation of these works is that expUcit verification of the conditions (such as the irrepresentable condition 
of ifTSll and the incoherence condition of |[T9l ) that a generic design matrix X needs to satisfy is computationally 
intractable for k ^ The most general (and non-asymptotic) model-selection results using the lasso for arbitrary 
design matrices have been reported in 1201 . Specifically, Candes and Plan have established in ||20| that the lasso 
correctly identifies most models with probability 1 — 0{p^^) under certain conditions on the smallest nonzero entry 
of /3 provided: (i) the spectral norm (the largest singular value) and the worst-case coherence of X are not too large, 
and (ii) the values of the nonzero entries of /3 are independent and statistically symmetric around zero. Despite 
these recent theoretical triumphs of the lasso, it is still desirable to study alternative solutions to the problem of 
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polynomial time, model-order agnostic model selection in a compressed setting. This is because!^ 

1) Lasso requires the minimum singular values of the submatrices of X corresponding to the true models to be 
bounded away from zero ||2l, lfT8l - ll20l . While this is a plausible condition for the case when one is interested 
in estimating /3, it is arguable whether this condition is necessary for the case of model selection. 

2) The current literature on model selection using the lasso lacks guarantees beyond k fi^^ for the case of 
generic design matrices and arbitrary nonzero entries. In particular, given an arbitrary design matrix X, 
lfT8l - ll20l do not provide any guarantees beyond k ^ y/n. for even the simple case of /? e Ml^. 

3) The computational complexity of the lasso for generic design matrices tends to be 0{p^ + np^) |10j. This 
makes the lasso computationally demanding for large-scale model-selection problems. 

Recently, a few researchers have raised somewhat similar concerns about the lasso and revisited the much older 
(and oft-forgotten) method of thresholding for model selection ifTOl . 1221 - 1241 . which has computational complexity 
of 0{np) only and which is known to be nearly optimal for pxp orthonormal design matrices l25l . Algorithmically, 
this makes our approach to model selection similar to that of ifTOl , l22l - l24l . Nevertheless, the OST algorithm 
presented in this paper differs from ITOl , Il22l - ll24l in five key aspects: 

1) Model-Order Agnostic Model Selection: Unlike ifTOl , Il22l - l24l . the OST algorithm presented in this paper is 
completely agnostic to both the true model order k and any estimate of k. 

2) Generic Design Matrices and Arbitrary Nonzero Entries: The results reported in this paper hold for arbitrary 
(random or deterministic) design matrices and do not assume any statistical prior on the values of the nonzero 
entries of (3 even when k scales linearly with n. In contrast, l23l only studies the problem of Gaussian design 
matrices whereas the most influential results reported in ITOll . l22l . l24l assume that the values of the nonzero 
entries of /3 are independent and statistically symmetric around zero. 

3) Verifiable Sufficient Conditions: In contrast to ITOI . l22l - l24l . we relate the model-selection performance of 
OST to two global parameters of X, namely, fi and i^, which are trivially computable in polynomial time: 
fiiX) = IIX^X - /||„,ax and iy{X) = ^IK^^X - /)l||oo. 

4) Non-Asymptotic Theory: Similar to ITOl , l23]| , l24l . the analysis in this paper can be used to establish that OST 
achieves (asymptotically) consistent model selection under certain conditions. However, the results reported 
in this paper are completely non-asymptotic in nature (with explicit constants) and thereby shed light on the 
rate at which OST achieves consistent model selection. 

5) Partial Model Selection: In addition to the exact model-selection performance of OST, we also characterize 
in the paper its partial model-selection performance. In this regard, we establish that the universal threshold 
proposed in Section |ll] for OST guarantees S C S with high probability and we quantify the cardinality of 
the estimate S. On the other hand, both l22l and l23ll study only exact model selection, whereas ITOl . l24l 

During the course of revising this paper we also became aware of ED, which proposes a tliresholded valiant of basis pursuit [3] for sparsity 
pattern recovery using Gaussian design matiices. However, the results reported in 1211 are hmited because of similar issues and because of the 
requirement that the magnitude of the smallest nonzero entry of /3 be known to the algorithm. 
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Study approximate (though not partial) model selection only for Gaussian design matrices fTOl and assuming 
Gaussian (resp. statistical) priors on the nonzero entries of /3 E4\ (resp. ifTOl ). 

We conclude this discussion of model selection by making three important remarks. First, to the best of our 
knowledge, Donoho in 19] Theorem 7.2] reported some of the earliest known results for thresholding in the 
compressed setting. Nevertheless, the conclusion drawn in was that thresholding is feasible for model selection 
as long as k ^ fJ-^^, the so-called "square root bottleneck." Second, the structure of OST and the model-order 
agnostic threshold of this paper enable us to carry out localized model selection. Specifically, if one is provided at 
the time of recovery with a set T such that T D S then the threshold proposed in this paper enables one to carry 
out model selection using the submatrix Xj- instead of X, thereby reducing the complexity of OST from 0{np) 
to 0{n\T\)- Third, the results reported in this paper hold for any n < p and, in particular, the universal threshold 
proposed here for model selection reduces to the universal threshold proposed by Donoho and Johnstone 1251 for 
p X p orthonormal design matrices. In this sense, some of the results reported in l25ll can also be thought of as 
special instances of the results reported in this paper 

Finally, in the context of sparse-signal recovery in the compressed setting, there exists now a large body of 
Uterature that studies this problem under the rubric of compressed sensing l26l . However, convex optimization 
procedures such as basis pursuit (BP) 0, Dantzig selector l27l , and lasso — although known for their ability 
to recover sparse signals under a variety of conditions — are ill-suited for large-scale problems because of their 
computational complexity. On the other hand, low-complexity iterative algorithms such as matching pursuit l28l . 
subspace pursuit l29l , CoSaMP l30l . and iterative hard thresholding l3Tl . and combinatorial algorithms based on 
group testing such as HHS pursuit l32| and Fourier samplers l33l , l34l have been shown to perform well either 
only for some special classes of design matrices 1321 - 1341 or for design matrices that satisfy the restricted isometry 
property (RIP) l35l . Nevertheless, explicitly verifying that X satisfies the RIP of order k >^ p^^ is computationally 
intractable; in particular, since we have from the Welch bound ll36l that p^^ ^ %fn for p ^ 1, the guarantees 
provided in 1291 - 1311 for the case of generic design matrices at best hold only for fc-sparse signals with k ;:j ^/n. 

In contrast, and motivated by the need to have verifiable sufficient conditions for low-complexity algorithms and 
arbitrary values of the nonzero entries of fi even when k ^ ^/n, we extend in Section |lll] our results on model 
selection using OST and characterize the performance of Algorithm |2] in terms of three global parameters of the 
design matrix X\ p(X), I'iX), and i|^||2- In particular, a highlight of this part of the paper is that we partially 
strengthen the results of Pfander et al. l37l and Herman and Strohmer l38l by establishing that Gabor frames 
generated from the Alltop sequence can be used along with OST to recover most fc-sparse signals belonging to 
certain classes even when fc >3 ^/n. It is worth pointing out here that both l37ll . l38l also establish that Alltop Gabor 
frames can recover most fc-sparse signals — albeit using BP — even when fc >3 y/n. Nevertheless, the basic difference 
between l37l . ll38l and the work presented here is that l37l . |[38l require the phases of the nonzero entries of /3 to 
be statistically independent and uniformly distributed on the unit torus whereas we do not assume any statistical 
prior on the values of the nonzero entries of /3. Note in particular that, just like the lasso result in l20l . the results 
reported in l37l . l38l for Alltop Gabor frames consequently do not provide any guarantees beyond fc ^ for 
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^2 = Set of fc-sparse signals in Si for = Set of signals in Ei — E2 that are 

which k ;j ^/n. supported on "bad" subsets 




t 

El = Space of all fc-sparse unimodal signals in M^J. such that k ^n/ log n 

Fig. 1 . A Venn digram used to illustrate the major difference between the BP-based recovery guarantees and the OST-based recovery guarantees 
for fe-sparse unimodal signals in measured using Alltop Gabor frames. The OST algorithm is guaranteed to recover /3 £ Si — Ffl. But 
BP, unlike OST, is only guaranteed to recover /3 £ S2 in this case. 

even the simple case of /3 G K^. This difference between the BP-based recovery guarantees presented in 1371 . 
Il38l (which are essentially based on ||39l ) and the OST-based recovery guarantees provided in this paper is also 
illustrated using a Venn diagram in Fig. [T]for unimodal signals (defined as: « c for some arbitrary c > and 
for all i e S). 

D. Notation 

The following notation is used throughout the rest of this paper. We use lowercase letters to denote scalars and 
vectors, while we use uppercase letters to denote matrices. We also use 0, 1, and / to denote the all-zeros vector, 
the all-ones vector, and the identity matrix, respectively. In addition, we use \\v\\p to denote the usual ^p-norm of 
a vector v, while we use A^, ||A||2, and i|v4j|i„ax to denote the Moore-Penrose pseudoinverse, the spectral norm, 
and the maximum magnitude of any entry of a matrix A, respectively. Further, we use (•)^ and (•)^ to denote 
the operations of transposition and conjugate transposition, respectively, while we use (•, •) to denote inner product 
that is conjugate linear in the first argument. Finally, given a set I, we use vx to denote the part of a vector v 
corresponding to the indices in I and Ax to denote the submatrix obtained by collecting the \I\ columns of a 
matrix A corresponding to the indices in I. 

E. Organization 

The rest of this paper is organized as follows. In Section [III we propose a model-order agnostic threshold for the 
OST algorithm and characterize both the exact and the partial model-selection performance of OST. In Section [nil 
we extend our results on model selection and characterize the sparse-signal recovery performance of OST. In 
Section IIVI we specialize the model-selection and the sparse-signal recovery results of the previous sections to 
Gabor frames. Finally, we provide proofs of the main results of this paper in Section |V] and conclude with a 
discussion of the limitations and extensions of our results in Section |Vll 
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II. Model Selection Using One-Step Thresholding 

A. Assumptions 

Before proceeding with presenting our results on model selection using OST, we need to be mathematically precise 
about our problem formulation. To this end, we begin by reconsidering the measurement model y ^ X/S + ij and 
assume that X is an n x p real- or complex-valued design matrix having unit £2-norm columns, /3 S is a fc-sparse 
signal (||/3||o < k), and k < n < p. Here, we allow X to be either a random or a deterministic design matrix, while 
we take 77 to be a complex additive white Gaussian noise vector. It is worth mentioning here though that Gaussianity 
of rj is just a simplified assumption for the sake of this exposition; in particular, the results presented in this section 
are readily generalizable to other noise distributions as well as perturbations having bounded £2-norms. Finally, the 
main assumption that we make here is that the true model S ^ {i E {1, ■ ■ ■ ,p} : > 0} is a uniformly random 
/c-subset of {1, . . . In other words, we have a uniform prior on the support of the data vector /3. 

B. Main Results 

Intuitively speaking, successful model selection requires the columns of the design matrix to be incoherent. In 
the case of the lasso, this notion of incoherence has been quantified in ifTSll and |fT9l in terms of the "irrepresentable 
condition" and the "incoherence condition," respectively (see also 1201 ). In contrast to earlier work on model 
selection, however, we formulate this idea of incoherence in terms of the coherence property. 

Definition 1 (The Coherence Property). An n x p design matrix X having unit £2 -norm columns is said to obey 
the coherence property if the following two conditions hold: 

/i(X)<-^^, and (CP-1) 

lyiX) < . (CP-2) 
Vn 

In words, ( ICP-ll ) roughly states that the columns of X are not too similar, while ( ICP-2b roughly states that the 
columns of X are somewhat distributed within the n-dimensional unit ball. Note that the coherence property 
is superior to other measures of incoherence such as the irrepresentable condition in two key aspects. First, it 
does not require the singular values of the submatrices of X to be bounded away from zero. Second, it can be 
easily verified in polynomial time since it simply requires checking that — /||max < (200 logp)^^/^ and 

WiX^X - /)l||oo <{p- l)n-i/2||xHx - 

Below, we describe the implications of the coherence property for both the exact and the partial model-selection 
performance of OST. Before proceeding further, however, it is instructive to first define some fundamental quantities 
pertaining to the problem of model selection as follows: 

/3„,i„ = mill 1/3,1, MAR= 



21 



SNRn,|n = , " " ,. , SNR = 



vm/k' n\\v\ 



2\ 
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In words, /3niin is the magnitude of the smallest nonzero entry of /3, while MAR — which is termed as minimum-to- 
average ratio 0231 — is the ratio of the energy in the smallest nonzero entry of /3 and the average signal energy per 
nonzero entry of /3. Likewise, SNRmin is the ratio of the energy in the smallest nonzero entry of /? and the average 
noise energy per nonzero entry, while SNR simply denotes the usual signal-to-noise ratio in the system. It is also 
worth pointing out here the relationship between SNR,nin and SNR and MAR; specifically, it is easy to see that 
SNRinin = SNR • MAR. We are now ready to state the first main result of this paper that concerns the performance 
of OST in terms of exact model selection. 

Theorem 1 (Exact Model Selection Using OST). Suppose that the design matrix X satisfies the coherence property 
and let rj be distributed as CN{Q, a^I). Next, choose the threshold A = max | jlO/i-y/^ ' SNR, ^/2a^logp 
for any t g (0,1). Then, if we write ^{X) as ^ = cinT^/^ for some ci > (which may depend on p) and 
7 G {0}U [2,oo), the OST algorithm (Algorithm]!^ satisfies Pr(5 7^ 5) < 6p~^ provided p > 128 and the number 
of measurements satisfies 

2^1ogP,— — 2fclogp, 2fclogp \ 

SNRinin V MAR / J 

2fclogn, — ^ — 2fclogn, ^ 2fclog») \. (1) 

SNR • MAR V MAR / J 

Here, the quantity C2 > is defined as 02 (20 ci)^, while the probability of failure is with respect to the true 
model S and the complex Gaussian noise vector rj. 

The proof of this theorem is provided in Section |Vl Note that the parameter T in Theorem [T] can always be 
fixed a priori (say t = 1/2) without affecting the scaling relation in ([TJ. In practice, however, t should be chosen 
so as to reduce the total number of measurements needed to ensure successful model selection; the optimal choice 
of t in this regard is topt = argmin ^max ^ji^^^^^2fc logp, ^^2l_2fclogp^ j>^. Notice also that Theorem[T] 
is best suited for applications where one is interested in quantifying the minimum number of measurements needed 
to guarantee exact model selection for a given class of signals. Alternatively, it might be the case in some other 
applications that the problem dimensions are fixed and one is instead interested in specifying the class of signals 
that leads to successful model selection. The following variant of Theorem [T] is best suited in such situations. 

Theorem 2. Suppose that the design matrix X satisfies the coherence property and let the noise vector t] be dis- 
tributed as CAf{0, (7^1). Next, let p > 128 and choose the threshold A = max | jlO/iV'^- ■ SNR, j---^\/2| •\/2cr^ logp 
for any t G (0, 1). Then the OST algorithm (Algorithm |7} satisfies Pr(5 ^ S) < 6p~^ as long as we have that 
k <n/(2 \ogp) and 

M.H>„„{.a-r=(||i).4oo«-(^)}. 

Here, the probability of failure is with respect to the true model S and the complex Gaussian noise vector rj. 



Note that the proof of Theorem |2] follows directly from the proof of Theorem [T| There are a few important 
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Algorithm 3 The Sorted One-Step Thresholding (SOST) Algorithm for Model Selection 




{Form signal proxy} 
{Sort the signal proxy} 
{Select model via OST} 



remarks that need to be made at this point concerning the threshold proposed in Theorem [T] and Theorem |2] for the 
OST algorithm. First, it is easy to see that the proposed threshold is completely agnostic to the model order k and 
only requires knowledge of the SNR and the noise variance. Second, some of the bounds in the proof of Theorem[T] 
and extensive simulations suggest that the absolute constant 10 in the proposed threshold is somewhat conservative 
and can be reduced through the use of more sophisticated analytical tools (also see Section IVIb . Finally, while 
estimating the true model order k tends to be harder than estimating the SNR and the noise variance in majority 
of the situations, it might be the case that estimating k is easier in some applications. It is better in such situations 
to work with a slight variant of the OST algorithm (see Algorithm [31 that relies on knowledge of the model order 
k instead and returns an estimate S corresponding to the k largest (in magnitude) entries of X^y. We characterize 
the performance of this algorithm — which we term as sorted one-step thresholding (SOST) algorithm — in terms of 
the following theorem. 

Theorem 3 (Exact Model Selection Using SOST). Suppose that the design matrix X satisfies the coherence property 
and let rj be distributed as CA/'(0, a^I). Next, write ^{X) as /j, ~ cin^^^'^ for some ci > (which may depend on 
p) and 7 G {0} U [2, oo). Then the SOST algorithm (Algorithm\3^ satisfies Pr(5 ^ S) < Qp^^ as long as p > 128 
and the number of measurements satisfies 



Here, the quantity C2 > is as defined in Theorem [7] while the probability of failure is with respect to the true 
model S and the complex Gaussian noise vector rj. 

The proof of this theorem is just a slight variant of the proof of Theorem [T] and is therefore omitted here. A few 
remarks are in order now concerning OST and SOST. First, the computational complexity of SOST is comparable 
with that of OST since efficient sorting algorithms (such as heap sort) tend to have computational complexity of 
0{p\ogp) only. Second, ([TJ and (O suggest that knowledge of the true model order k allows SOST to perform 
better than OST in situations where the threshold parameter t is fixed a priori (cf. Theorem [T]!. In this sense, SOST 
should be preferred over OST for exact model selection provided one has accurate knowledge of the true model 
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order k. On the other hand, OST should be the algorithm of choice for model-selection problems where it is difficult 
to obtain a reliable estimate of the true model order We conclude this discussion by rephrasing Theorem [3] for 
SOST along the fines of Tfieorem |2] for OST. 

Theorem 4. Suppose that the design matrix X satisfies the coherence property. Next, let p > 128 and let the 
noise vector r/ be distributed as CAf{0, a^I). Then the SOST algorithm (Algorithm\3} satisfies Pt{S ^ S) < &p~^ 
provided k < n/{2logp) and 

MAR > min max i 8(1 - pfclogP \ 4qq^-2 / 2fclogP \ I 
te(o,i) I V^-SNR/ V / J 

Here, the probability of failure is with respect to the true model S and the complex Gaussian noise vector rj. 

The final result that we present in this section concerns the partial model-selection performance of OST. Specif- 
ically, note that our focus in this section has so far been on specifying conditions for either the number of 
measurements or the MAR of the signal that ensure exact model selection. In many real-world applications, however, 
the parameters of the problem are fixed and it is not always possible to ensure that either the number of measurements 
or the MAR of the signal satisfy the aforementioned conditions. A natural question to ask then is whether the OST 
algorithm completely fails in such circumstances or whether any guarantees can still be provided for its performance. 
We address this aspect of the OST algorithm in the following and show that, even if the MAR of j3 is very small, 
OST has the ability to identify the locations of the nonzero entries of /3 whose energies are greater than both the 
noise power and the average signal energy per nonzero entry. In order to make this notion mathematically precise, 
we first define the m-th largest-to-average ratio (LAR„i) of (3 as the ratio of the energy in the m-th largest (in 
magnitude) nonzero entry of (3 and the average signal energy per nonzero entry of /3; that is. 



LAR,: 



IIA^Il2/ 

where /3(,„) denotes the m-th largest nonzero entry of (3 (note that MAR = LAR^). We are now ready to specify the 
partial model-selection performance of the OST algorithm. 

Theorem 5 (Partial Model Selection Using OST). Suppose that the design matrix X satisfies the coherence 
property. Next, let p > 128 and rj be distributed as CM{Q, cr^I). Finally, fix a parameter t G (0, 1) and choose the 
threshold X = max | jlOjuV?^ • SNR, -^^\/2"^ ^/2a^logp. Then, under the assumption that k < n/(21ogp), the 
OST algorithm (Algorithm\l} guarantees with probability exceeding 1 — 6p^^ that S C S and \S — S\ < (fc — M), 
where AI is the largest integer for which the following inequality holds: 

Here, the probability of failure is with respect to the true model S and the complex Gaussian noise vector rj. 

The proof of this theorem, which relies to a great extent on the proof of Theorem [T] is provided in Section [V] 
We conclude this section by pointing out that no counterpart of Theorem |5] exists for the SOST algorithm since we 
can never have S C S in that case because of the nature of the algorithm. 
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C. Discussion 

The results reported in this section can be best put into perspective by considering some specific model-selection 
problems that are commonly studied in the literature and juxtaposing our results with the ones reported in previous 
works. The rest of this section is devoted to such comparison purposes. 

1) Gaussian Design Matrices: Matrices with independent and identically distributed (i.i.d.) J\f{0,l/n) entries 
(i.e., Gaussian matrices) are perhaps the most widely assumed design matrices in the model-selection literature. 
In order to specialize our results to Gaussian design matrices, we first need to specify the worst-case coherence 
fj, and the average coherence f of i.i.d. Gaussian matrices. The first lemma that we have in this regard follows 
immediately from Proposition |5] in Appendix |A] through a simple union bound argument. 

Lemma 1 (Worst-Case Coherence of Gaussian Matrices). Let X be an n x p design matrix with i.i.d. Af{0, 1/n) 
entries. Then, as long as n > 601ogp, we have that n{X) < y^ i5'°sp " ^jj/i probability exceeding 1 — 2p~^. 

Remark 1. A cautious reader might argue here that Lemma [T] only provides an upperbound on the worst-case 
coherence of Gaussian design matrices. Nevertheless, the results (and the definition of the coherence property) 
presented earUer in this section remain valid if one replaces ii{X) with an upperbound jl{X) on fJ,{X). 

Lemma 2 (Average Coherence of Gaussian Matrices). Let X be an n x p design matrix with i.i.d. Af{0, 1/n) 
entries. Then, as long as p > n > 601ogp, we have that i^lX) < ^/^^^°sp ^jf/i probability exceeding 1 — 2p^^. 

Proof: The proof of this lemma is also a direct consequence of Proposition [5] in Appendix |A] Specifically, fix 
an index i S {1, • . • ,p} and define = ^p-i '^j^^i ^i' Then it is easy to see that is distributed as A/'(0, 1 /n) 
and it is independent of x^. Therefore Proposition |5] in Appendix lAl implies through a simple union bound argument 
that maxi |(xi,Xi)| < \J^^^ with probability exceeding 1 — 2p^^ as long as n > GOlogp. The proof of the 
lemma now follows from the fact that p > n and = max^ |(xi,Xi)|. ■ 

Lemma [T] and Lemma |2] establish that Gaussian design matrices satisfy the coherence property with high 
probabiUty as long as n >^ (logp)^. Theorem[T](resp. Theorem[3]i therefore implies that OST (resp. SOST) correctly 
identifies the exact model with probability exceeding 1 — 0{p^^) as long as n max |l, snr^mar ' ^^ar^}'''^'-'S^'' ^'^ 
particular, this suggests that if either l\/lAR(/3) = 6(1) or SNR = 0(1) then OST leads to successful model selection 
with high probabihty provided n max |l, snr^mar |fe logpB the other hand, one of the best known results for 
model selection using the maximum likelihood algorithm requires that n >^ max | ^snr^mar*^"* ' ^ ^"-"S (p/^) | EQ) (also 
see 1231 , BTl ). This establishes that OST (and its variants) performs near-optimally for Gaussian design matrices 
provided (i) the SNR in the measurement system is not too high or (ii) the energy of any nonzero entry of /? is not 
too far away from the average energy |j/3|j2/^ and k scales sub linearly with p. 

Remark 2. It is worth pointing out here that somewhat similar results can also be obtained for sub-Gaussian design 

'^Here, and throughout the rest of this paper, we use the shorthand notation f{n) gin) (resp. /(n) ^ g(n)) to indicate that /(n) ]z gin) 
(resp. fin) ;^ gin)) modulo a logarithmic factor. 
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matrices (i.e., matrices with entries given by i.i.d. bounded random variables, etc.) using standard concentration 
inequalities. Note also that the preceding discussion regarding Gaussian design matrices strengthens the results of 
Fletcher et al. ||231 concerning asymptotic (Gaussian) model selection using thresholding (cf. Il23l Theorem 2]). 

2) Lasso versus OST: Historically, OST (and its variants) is preferred over the lasso because of its low compu- 
tational complexity. The results reported in this paper, however, bring forth another important aspect of OST (also 
see MIOII ): OST can lead to successful model selection even when the lasso fails. Specifically, note that the lasso 
solution is not even guaranteed to be unique if the minimum singular value of the submatrix of X corresponding to 
the true model is not bounded away from zero (see, e.g., ifTSl . |fT9l ). On the other hand, OST does not require the 
aforementioned condition for model selection. Note that this is in part due to the fact that model selection using 
the lasso is in fact a byproduct of signal reconstruction, whereas the aforementioned OST results do not guarantee 
signal reconstruction without imposing additional constraints on X. In other words, we have established in the 
paper that model selection is inherently an easier problem than signal reconstruction. 

Finally, it is worth comparing the model-selection performance of OST with that of the lasso for the cases when 
the lasso does succeed. In this regard, the most general result for model selection using the lasso states that if 
X is close to being a tight frame in the sense that ||X||2 ~ \fvjn then the lasso identifies the correct model 
with probability exceeding 1 — 0(p~^) as long as (i) the nonzero entries of /3 are independent and statistically 
symmetric around zero, (ii) fc n/logp, and (iii) MAR ^ ||20l Theorem 1.3]. On the other hand, assume 

now that the design matrix X has [i{X) x n~^/^ and viX) n^^; there indeed exist design matrices that satisfy 
these conditions (e.g., Gaussian matrices, as proved earlier, and Alltop Gabor frames, as proved in Section HVl l. We 
then have from Theorem |2] (resp. Theorem Hji that OST (resp. SOST) identifies the correct model with probabiUty 
exceeding 1 — 0{p^^) as long as k ;< n/\ogp and MAR ^ max|^, 1 1 ■ This suggests that, even for the 
cases in which the lasso succeeds, OST can be guaranteed to perform as well as the lasso in situations where either 
the energy of any nonzero entry of /3 is not too far away from the average energy (MAR = 6(1)) or the SNR 
is not too high (SNR = 0(1)). Equally importantly, and in contrast to the lasso results reported in ||20| . OST is 
guaranteed to attain this performance irrespective of the values of the nonzero entries of the data vector /3. 

3) Near-Optimality of OST: We have concluded up to this point that — under certain conditions on MAR and 
SNR — the OST algorithm can perform as well as the lasso and it performs near-optimally for Gaussian design 
matrices. We conclude this discussion by arguing that the OST algorithm in fact performs near-optimally for any 
design matrix that satisfies p{X) x n~^/^ and iy{X) ^ n^^ as long as MAR = B(l) or SNR = 0(l)|f| In order to 
accomplish this goal, we first recall the thresholding results obtained by Donoho and Johnstone IZSll — which form the 
basis of ideas such as the wavelet denoising — for the case of pxp orthonormal design matrices. Specifically, it was 
established in ||25]| that if X is an orthonormal basis then hard thresholding the entries of X^y at A x y^r^logp 
results in oracle-like performance in the sense that one recovers (with high probability) the locations of all the 

'Note that it trivially follows from the Welch bound 1361 that there exists no design matrix with p S> 1 that satisfies ^{X) X n~^f'^ with 
7 < 2. On the other hand, there does exist a lai'ge body of literature devoted to constructing matrices with l^(X) X n~^^'^ 1421 . 
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nonzero entries of /3 that are above the noise floor. 

Now the first thing to note regarding the results presented earUer in this section is the intuitively pleasing nature 
of the threshold proposed for the OST algorithm. Specifically, assume that X is an orthonormal design and notice 
that, since ^l{X) = in this case, the threshold A x max |^\/"- • SNR, l| ^/cr^ logp proposed earlier reduces to 
the threshold proposed in ||25]| and Theorem |5] guarantees that thresholding recovers (with high probability) the 
locations of all the nonzero entries of fi that are above the noise floor: LAR„i >3 m ^ S. Now consider 

instead design matrices that are not necessarily orthonormal but which satisfy x n^^/^ and ^{X) ^ rT^. 

Then we have from Theorem |5] that OST identifies (with high probability) the locations of the nonzero entries 
of P whose energies are greater than both the noise power and the average signal energy per nonzero entry: 
LARm ^ max | , 1 1 ^ ^ m G S. In particular, under the assumption that either MAR = 0(1) (and 

since MAR < LAR„i) or SNR = 0(1), this suggests that the OST in such situations performs in a near-optimal 
(oracle-like) fashion in the sense that it recovers (with high probability) the locations of all the nonzero entries of 
/? that are above the noise floor without requiring the design matrix X to be an orthonormal basis. 

III. Recovery of Sparse Signals Using One-Step Thresholding 

In this section, we extend our results on model selection using OST to model-order agnostic recovery of fc-sparse 
signals. In doing so, we also strengthen the results of Schnass and Vandergheynst 1221 for signal recovery using 
thresholding in at least three key aspects. First, we specify polynomial-time verifiable sufficient conditions under 
which recovery of fc-sparse signals using OST succeeds. Second, the threshold that we specify for the OST algorithm 
(Algorithm |2]i does not require knowledge of the model order k. Third, we do not impose a statistical prior on the 
nonzero entries of the data vector (3. Note that, just like i22|, we limit ourselves in this exposition to recovery of 
fc-sparse signals in a noiseless setting; extensions of these results to noisy settings would be reported in a sequel 
to this paper In other words, the measurement model that we study in this section is y = X/3 and the goal is to 
recover the fc-sparse /3 using OST under the assumption that the true model 5 == {i G {1, ■ ■ ■ ,p} '■ > 0} is a 
uniformly random fc-subset of {1, . . . ,p}. 

A. Main Result 

Intuitively speaking (and as noted in the discussion in Section |ll]i, the problem of sparse-signal recovery is 
inherently more difficult than the problem of model selection. We capture part of this intuitive notion in the 
following in terms of the strong coherence property. 

Definition 2 (The Strong Coherence Property). An n x p design matrix X having unit £2 -norm columns is said to 
obey the strong coherence property if the following two conditions hold: 

1^{X)<—^^ , and (SCP-1) 

60e logp 

uiX)<-^. (SCP-2) 
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In order to better illustrate the difference between the coherence property and the strong coherence property, note 
that we have from Lemma [T] and Lemma [2] that Gaussian design matrices satisfy the coherence property with high 
probability as long as n ^ (logp)^. On the other hand. Lemma [T] and Lemma |2] suggest that Gaussian design 
matrices satisfy the strong coherence property with high probability as long as n ^ (logp)'*. In other words, there 
are scaling regimes in which Gaussian design matrices satisfy the coherence property but are not guaranteed to 
satisfy the strong coherence property. We are now ready to state the main result of this section that makes use of 
the notation developed earlier in Section of the paper 



Theorem 6 (Sparse-Signal Recovery Using OST). Suppose that the design matrix X satisfies the strong coherence 
property and choose the threshold A = 1 
satisfies Pr(/3 7^ /3) < Qp^^ as long as 



property and choose the threshold A = 10/i||yj|2A/ .^'"ji^ for any p > 128. Then the OST algorithm (Algorithm^ 



1-0-1/2 



^ ^ • j P M 'MAR 1 

fc<min< 2l|yil2i '^1 (■ 

Here, the probability of failure is only with respect to the true model S (locations of the nonzero entries of 
while C3, C4 are positive numerical constants given by C3 = 37e and C4 = 43. 

The significance of this theorem can be best put into perspective by considering the case of the design matrix 
X being an approximately tight frame in the sense that j|X||2 ~ \ppfn\ indeed, we have that Gaussian design 
matrices satisfy this condition with high probability ll43l and that Gabor frames generated from any (unit-norm) 
nonzero vector satisfy ||X||2 = \/pJn (= ^/n) f44]|. It then follows from Theorem |6] that if X satisfies the strong 
coherence property then OST exactly recovers any fc-sparse vector (3 with high probability as long as fc ^ /i^^MAR; 
in particular, if we assume that MAR = G(l) then this condition reduces to A: ^ pr"^. On the other hand, low- 
complexity sparse-recovery algorithms such as subspace pursuit ||29ll . CoSaMP 1301 . and iterative hard thresholding 
BTl all rely on the restricted isometry property (RIP) ll35l . Therefore, the guarantees provided in ||29ll - ll3T]| for the 
case of generic design matrices are limited to /c-sparse signals that satisfy k ;^ /i^^, which is much weaker than 
the k ^ scaling claimed hereO We conclude this section by pointing out that if one does have knowledge of 
the true model order then it can be shown through a slight variation of the proof of Theorem |6] that SOST (the 
sorted variant of the OST) can also recover sparse signals with high probabihty — the only difference in that case 
being that the constant C4 in Theorem |6] gets replaced with a smaller constant C4 = -y/SOO. 

IV. Why Gabor Frames? 

Our focus in Section HIl and Section Hill has been on establishing that OST leads to successful model selection and 
sparse-signal recovery under certain conditions on three global parameters of the design matrix X: n{X), i^iX), 
and |1X||2- As noted earlier, one particular class of design matrices that satisfies these conditions is the class of 
random sub-Gaussian matrices. In contrast, our focus in this section is on establishing that Gabor frames — which 

'Note that the k ^ fi~^ claim is an easy consequence of the Gersgorin circle theorem 1451 : see, for example, 1391 . I46I - I48I . 
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are collections of time- and frequency-shifts of a nonzero seed vector in C" — also tend to satisfy the aforementioned 
conditions on the matrix geometry. Note that Gabor frames constitute an important class of design matrices because 
of the facts that (i) Gabor frames are completely specified by a total of n numbers that describe the seed vector, (ii) 
multiplications with Gabor frames (and their adjoints) can be efficiently carried out using algorithms such as the fast 
Fourier transform, (iii) Gabor frames arise naturally in many important application areas such as communications, 
radar, and signal/image processing, and (iv) there exist deterministic constructions of Gabor frames that (as shown 
next) are nearly-optimal in terms of the requisite conditions on niX), iy{X), and ||X||2- 



A. Geometry of Gabor Frames and Its Implications 

A (finite) frame for C" is defined as any collection of p > n vectors that span the n-dimensional Hilbert 
space C" 1491 . Gabor frames for C" constitute an important class of frames, having applications in areas such as 
communications ifSOl and radar ll38l . that are constructed from time- and frequency-shifts of a nonzero seed vector 
in C". Specifically, let g e C" be a unit-norm seed vector and define T to be an ri x ri time-shift matrix that is 
generated from g as follows 

,9i 5« 52 
.92 91 '■■ : 



r(.9) = 



(7) 



: ■ ■ 971 

Jn 9n-l 9l 

where we write T = T{g) to emphasize that T is a matrix-valued function on 
samples of a discrete sinusoid with frequency 2iT^,m E {0, . . . , n — 1} as a; 
Finally, define the corresponding n x n diagonal modulation matrices as Wm = diag(a;„i). Then the Gabor frame 
generated from g is an n x block matrix of the form 



Next, denote the collection of n 

1 T 

gi27r^0 _ gj27r^(n-l) 



X 



WqT WiT 



Wn-lT 



(8) 



In words, columns of the Gabor frame X are given by downward circular shifts and modulations (frequency shifts) 
of the seed vector g. We are now ready to state the first main result concerning the geometry of Gabor frames, 
which follows directly from ll44l . 

Proposition 1 (Spectral Norm of Gabor Frames P4|| ). Gabor frames generated from nonzero (unit-norm) seed 
vectors are tight frames; in other words, we have that \\X\\2 = \/n. 

Recall from Theorem |6] and the subsequent discussion in Section |lll] that design matrices with small spectral 
norms are particularly well-suited for recovery of fc-sparse signals. In this regard. Proposition [T] implies that Gabor 
frames are the best that one can hope for in terms of the spectral norm. The next result that we prove concerns the 
average coherence of Gabor frames. 
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Theorem 7 (Average Coherence of Gabor Frames). Let X be a Gabor frame generated from a unit-norm seed 
vector g e C". Then, using the notation ^max = maxi \gi\ and (?min = niin.; \gi\, the average coherence of X can 
be bounded from the above as follows: 

n 5max(\/?l - 5min) + I - U g^ 



v{X)< 



f ■ 



(9) 



Proof: In order to facilitate the proof of this theorem, we first map the indices of the columns of X from 
{1, . . . , n^} to C = {0, . . . , 71 — 1} X {0, . . . , n — 1} as follows 

' (10) 



K : I i-> y{i mod n) — 1, 

In words, K{i) = (£, m) signifies that the i-th column of X corresponds to the {£ + l)-th column of WmT. Next, 
fix an index i (resp. K{i) ~ {£,m)) and make use of the above reindexing to write 

71—1 n— 1 n— 1 

5Z(x„(i),XK(j)) = ^ (x£,m,X£',m') = ^ ^ (xf ,Tn , X£',m' ) + ^ (x^,m , X^_,„/ ) . (11) 



(«',m')ec 



e'=0 m'=0 



m'=0 
rn 



Finally, note that we can explicitly write the columns of X using (O for any {£, m) e C as follows 

r 



(12) 



where we use the notation as a shorthand for gg mod n- 

The rest of the proof now follows from simple algebraic manipulations. Specifically, it is easy to see from (fT2] i 
that the first term in (fTTI) can be simplified as 



n— 1 n— 1 
£'=0 m'=0 



(m' — m) 



<J=1 f'=0 



m'=0 



n n— 1 n— 1 

2^ 3(g-f)„5(g-r)„ 2^ e-' " ^ ^ + 

q=2 e'=0 m'=0 

n— 1 n— 1 

+ E 3(i-«)-f(i-f')n = ".9(*i-f)„ E 9{i-e 



(13) 



£'=0 



where (a) in the above expression is a consequence of the fact that 'J2m'-Lo'^''^^''" = for any fixed 

q G {2, . . . , n}. Likewise, we can simplify the second term in (fTTl i as follows 



m'=0 9=1 m'=0 

n n—1 n— 1 

EU P oJ27razii (,„'_„!) I |2 

|5(9-£)„| 2^ e-' " ^ ^ + |<7(i-£)„| 2^ 1 

9=2 jn'=0 m'=0 

n 

(b) I |2 / N I 1 2 I I 

= ^ }^\9iq-e)^\ + (" - l)|5(i-f)„| = -l + "-|5(i-£)„| 

9=2 



(14) 
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where (6) follows from the fact that X^m'^-^m '^"'^'^ '"^ "'^ = fixed q £ {2, . 

To conclude the theorem, note from ( fTTT i. ( fTST l, and (fl4] i that we can write 

n-1 

^(xj,Xj) = max n.g(*i_^)^ ^ .9(i-r)„ - 1 + '^|ff(i-f)„| 



max 

i6{l,...,n2} 



e'=o 



< max 

re{l,...,n} 



s^r 



<n max Iffriy^lffs 

r(^{l,...,n} 



s=l 



max 

re{l,...,n} 



^ 5min 



n\gr\ 



(15) 



Here, (c) mainly follows from the triangle inequality and a simple reindexing argument, while (d) mainly follows 
from the Cauchy-Schwarz inequality since X)"=i \9s\ = \\g\\i ^ \gr\ < — 9min- The proof of the theorem now 

s^r 

follows by dividing the above expression by — 1. ■ 
In words, Theorem [7] states that the average coherence of Gabor frames cannot be too large. In particular, it 
implies that Gabor frames generated from unimodal (unit-norm) seed vectors (i.e., seed vectors characterized by 
<?min X ffmax X Ti^^^^) Satisfy ^ . On the other hand, recall that the Welch bound ll36l dictates that 

IJi{X) > (n + 1)^^/^ for Gabor frames. It is therefore easy to conclude from these two facts that Gabor frames 
generated from unimodal seed vectors are automatically guaranteed to satisfy the coherence property (resp. strong 
coherence property) as long as ^{X) ;^ (logp)~^/^ (resp. ;^ (logp)~^). In the context of model selection 

and sparse-signal recovery. Theorem [T] therefore suggests that Gabor frames generated from unimodal seed vectors 
are the best that one can hope for in terms of the average coherence. 

Finally, recall from the discussions in Section HI] and Section HU] that — among the class of matrices that satisfy 
the (strong) coherence property — design matrices with small worst-case coherence are particularly well-suited for 
model selection and sparse-signal recovery. In the context of Gabor frames, the goal then is to design unimodal seed 
vectors that yield Gabor frames with the smallest-possible worst-case coherence. This, however, is an active area of 
mathematical research and a number of researchers have looked at this problem in recent years; see, e.g., HI. As 
such, we can simply leverage some of the existing research in this area in order to provide explicit constructions 
of Gabor frames that satisfy the (strong) coherence property with nearly-optimal worst-case coherence. 
Specifically, let ?? > 5 be a prime number and construct a unimodal seed vector g G C" as follows 



9 

n-1 



1 eJ2-i 



(16) 



{^3 -| n — J- 
-1—q3'^'^— \ is termed as the Alltop sequence Q in the literature. This sequence has the property 
V " J q=0 

that its autocorrelation decays very fast and, therefore, it is particularly well-suited for generating Gabor frames 
with small worst-case coherence. In particular, it was established recently in |8l that Gabor frames generated from 
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the AUtop seed vector g given in ( fT6] l satisfy 



In addition, since we have that g, 




Theorem |7] that the average coherence of AUtop Gabor frames satisfies v{X) < [n + 1)^^ < fi{X)/y/n. An 
immediate consequence of this discussion is that all the results reported in Section and Section Hill in the context 
of model selection and sparse-signal recovery using OST apply directly to the case of Alltop Gabor frames. In 
particular, it follows from Theorem |6] that Alltop Gabor frames together with OST are guaranteed to recover most 
fc-sparse signals — ^regardless of the statistical dependence across the nonzero entries of /3 — as long as fc ^ n and 
MAR = 6(1). In contrast, the only other results available in the sparse-signal recovery literature for Alltop Gabor 
frames are based on the higher-complexity basis pursuit |I3| and require the nonzero entries of /3 to be independent 
and statistically symmetric around zero for the case when ^/n ^ k 1371 . 1381 . 



In this section, we provide detailed proofs of the main results reported in Section |ll] and Section [III] Before 
proceeding further, however, it is advantageous to develop some notation that will facilitate our forthcoming analysis. 
In this regard, recall that the true model S is taken to be a uniformly random fc-subset of |p] = {1, . . . ,p}. We 
can therefore write the data vector f3 under this assumption as concatenation of a random permutation matrix and 
a deterministic fc-sparse vector. Specifically, let z € C be a deterministic fc-sparse vector that we write (without 
loss of generality) as 



= zSC" (p— fe) times 

and let P,^ he a p x p random permutation matrix; in other words, 

r 1 

where e^ denotes the j-th column of the canonical basis / and fl ^ (tti, . . . , tt^) is a random permutation of 
Then the assumption that the model 5 is a random subset of |p] is equivalent to stating that the data vector f3 can 
be written as f3 = PtjZ. In other words, the measurement vector y can be expressed as 



where 11 = (tti, . . . , tt^) denotes the first k elements of the random permutation fl, Xu denotes the nxk submatrix 
obtained by collecting the columns of X corresponding to the indices in 11, and the vector z e C'^ represents the 
k nonzero entries of /?. 

1) Proof of TheoremU] The general road map for the proof of Theorem [T] is as follows. Below, we first introduce 
the notion of {k, e, 5)-statistical orthogonality condition (StOC). We next establish the relationship between the StOC 
parameters and the worst-case and average coherence of X in Lemma |3] and Lemma |4] We then provide a proof 



V. Proofs of Main Results 




(18) 



y = XI3 + rt^ XP^z + rt^ Xjiz + f] 



(20) 
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of Theorem [T] by first showing that if X satisfies the StOC then OST recovers S with high probability and then 
relating the results of Lemma [3] and Lemma |4] to the coherence property. 

Definition 3 ((fc, e, (5)-Statistical Orthogonality Condition). Let S = (tti, . . . , TTp) be a random permutation of \p\, 
and define 11 = (tti , . . . , tt^ ) and W' ^ {-Kk+i, ■ ■ ■ ,TTp) for any fc G |p] . Then the nx p (normalized) design matrix 
X is said to satisfy the (fc, e, (5)-statistical orthogonality condition if there exist e, (5 £ [0, 1) such that the inequalities 



\\{XlXn-I)z\\^<4zh 
IjX" Xnzlloo < 442 



(StOC-1) 
(StOC-2) 



hold for every fixed 2 e C'^ with probability exceeding 1 — S (with respect to the random permutation IT). 



Remark 3. Note that the StOC derives its name from the fact that if X is a pxp orthonormal matrix then it trivially 
satisfies the StOC for every k e [p] with e = S ~ 0. In addition, although we will not use this fact explicitly in the 
paper, it can be checked that if X satisfies {k,e,6)-StoC then it approximately preserves the ^2-norms of fc-sparse 
signals with probability exceeding 1 — 5 as long as fc < e~^. 

Having defined StOC, our goal in the next two lemmas is to relate the StOC parameters k,€, and d to the 
worst-case and average coherence of the design matrix X. 

Lemma 3. Let 11 = (tti, . . . ,Trk) denote the first k elements of a random permutation of\p\ and choose a parameter 
a > 1. Then, for any fixed z £ C'^, e G [0, 1), and k < min je^i^^^, (1 + a)^^p}, we have 

ie-Vk.f . ^^^^ 



Pr ( |X does not satisfv dStOC-lb ) ) < 4fccxp - , „ 
V 'J \ 16(2 + a~^)"^^^ 

Proof: The proof of this lemma relies heavily on the so-called method of bounded differences (MOBD) ISTl 



Specifically, we begin by noting that ||(X^Xn — I)z^ = max 



Therefore for a fixed index i. 



and conditioned on the event Ai' = {tt; = «'}, we have the following equality from basic probability theory 



Pr ( I ^ (x^, , Xtt^ ) I > e||z||2 A' ] = Pi' ( | Zj{ 



Ai 



(22) 



Next, in order to apply the MOBD to obtain an upper bound for ( |22] |. we first define a random (fc — l)-tuple 
n^' = (tti, . . . , 7r.i_i, TTi+i, . . . , TTfc) and then construct a Doob martingale {Mq, Mi, . . . , Mk~i) as follows: 



Mo = E 1^ ^ Zj {Ki> , x^^. ) Ai' and = E |^ ^ zj (xj/ , x^^ ) , Ai' 



i=i 
j'A 



j=i 



, i,...,fc-i 



(23) 



where tt^^^ denotes the first £ elements of 11 *. The first thing to note here is that we have from the linearity of 
(conditional) expectation 
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P I 



q=l 



(6) 



(24) 



where (a) follows from the fact that, conditioned on Ai', ttj has a uniform distribution over |p] — {«'}, while (5) is 
mainly a consequence of the definition of average coherence. In addition, if we use tt^' to denote the £-th element 
of n^* and define 



M, 



j=i 



l,...,fc-l 



(25) 



then, since (Aig, A/i, . . . , Mfe_i) is a Doob martingale, it can be easily verified that \Mi — Mi_i \ is upper bounded 
by sup,_, [A-h{r) - M,(s)] (see, e.g., |5l). 

Now in order to obtain an upper bound for sup^ ^ \M^{r) — (s)] , notice that 



Mi{r) - Mi{s)\ = 



(e 


(Xi' , X,r^. ) 


""l-i-f-l' ~ ^' -^i' 


-E 






) 


E 


(Xj' , X^^. ) 




-E 


(Xj' , X^Tj ) 



















— dp 



E I^jIH^.jI + E I^jII'^' 



(26) 



In addition, we have that for every j > C. + l,j ^ i, the random variable ttj has a uniform distribution over 
IpI — {TT^\^^_-^,r,i'} when conditioned on {7r^_^^_j^, tt^' = r,i'}, whereas ttj has a uniform distribution over 
IpI — {tTi^^_i, s,i'} when conditioned on {T^i^g_i,T^J^ = s, *'}■ Therefore, we obtain 



\di,. 



p-l-1 



1 Xr ) y^i' 7 X^ 



< 



2^ 



1 



(27) 



Similarly, it can be shown that '^j<i+i \zj\\dt,j\ < |z£+i|2/i when i < i, '^j<e+i \zj\\dij\ < \zi\^2^ when 
i = ^ + 1, and X]j< | | I'^^.i | — (kf I + ~^ri )2/Lt when i > £ + 1. Consequently, regardless of the initial choice 
of i, we conclude that 



sup [Mi{r) - Mi{s)\ < 2^( \ze\ + \ze+i\ + J2 l^jl 



(28) 



We have now established that (A/q, A/i, . . . , Mk-i) is a (real- or complex-valued) bounded-difference martingale 
sequence with |Mf — Mf„i| < 2^c?£ for £ = 1, . . . , fc — 1. Therefore, under the assumption that k < e^v^^ and 
since it has been established in ( |24] | that |Afo| < y/kv \ \z\\2, it is easy to see that 



i=i 



A' I <Pr( |A4_i-A/o| >e|kll2-\/fct^ 11^112 
{e~Vk,ynz\\l\ 



A, 



< 4exp - 



(29) 
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where (c) follows from the complex Azuma inequality for bounded-difference martingale sequences (see Lemma [5] 
in AppendixlAt. Further, it can be established through routine calculations from ( l28T l that X^fci '^1 — (2+a~^)^ || z|| ^ 
since k <p/{l + a). Combining all these facts together, we finally obtain 



[d) 



Pr ||(XgXn-/)z||^ >e||z||2 < fc Pr | ^ z,(x,,,x,^.)| > e||z||2 



< 4fcexp 



(30) 



, 16(2 + a-i)2^2 I 

where [d) follows from the union bound and the fact that the tt^'s are identically (though not independently) 
distributed, while (e) follows from (|29l ) and the fact that tt^ has a uniform distribution over \p\. ■ 

Lemma 4. Let 11 = (tti, . . . , tt^) and H'^ = {nk+i , ■ ■ ■ , TTp) denote the first k and the last [p — k) elements of a 
random permutation of \p\ respectively, and choose a parameter a > 1. Then, for any fixed z G C*"', e € [0, 1), 
and k < min {e^i/^^, (1 + a)^^p}, we have 

{e-Vkvf 



Pr ( {X does not satisfy ( IStOC-2l l} ) < 4(p - k) cxp 



(l + a-i)V' 



(31) 



Proof: The proof of this lemma is very similar to that of Lemma [3] and also relies on the MOBD. To begin 

, where |p — fc] = {1, . . . ,p — fc} and tt^ denotes the 



with, we note that llXnc^n^^ll = max 



i 



i-th element of H^. Then for a fixed index i g |p — fc], and conditioned on the event Ai' = {tt^ = i'}, we again 
have the following equality 



>e|kll2 A') =Pr('|^Zj(x,,,x^^)| >e|lz|l2 
Next, as in the case of Lemma [3] we construct a Doob martingale sequence (A/o, Mi, . . . , Mk) as follows: 



(32) 



K 

Afo = ie[^^,(x,,,x,^; 



A,' and A/^ = E |^ ^ (x^/ , x^ J 7ri_y^ , Ai' ,^ = l,...,fc (33) 

where 7ri_j.^ now denotes the first i elements of 11. Then, since Hj has a uniform distribution over \p\ — {i'} 
when conditioned on Ai', we once again have the bound |A/o| < \/kv \ \z\\2- Therefore, the only remaining thing 
that we need to show in order to be able to apply the complex Azuma inequality to the constructed martingale 
(Afo, Ml, . . . , Mk) is that |A/£ - A/f_i| is suitably bounded. 



In this regard, we once again define Mi{r) = E ^ Zj(xi',X7r ) 



Af,(r) - M,{s) 



, 7r£ = r, Ai 



TTi^£-i,TTe ^ r, Ai' and note that 

-E {Xi',Xj, ) T:i^i^l,TTe ^ S,Ar 
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<\zA 



p-^- 1 



(34) 



- di 



which impUes that sup^^ [M^^r) — M^(s)] < £ = 1, . . . Consequently, we have now estabhshed that 

(Mo, Ml, . . . , Mk) is a bounded-difference martingale with \Mi — M£_i| < 2/x(i<i. Therefore, since k < e^v"^ and 
||z||2, we once again have from the complex Azuma inequality that 

k 



A' < Pr |A/fc - A/ol > e||z||2 - Vk 



(a) / 

< 4exp - 



(35) 



where (a) follows from the fact that (^j < {1 + a ^)^||2:||2 since k < p/{l + a). Combining all these facts 

together, we finally obtain the claimed result as follows 

(6) 



Pr[\\Xi._Xnz\\^>e\\zh) < {p ^ k) Pr | ^ z,(x,., x,^) | > e||z||2 



< (p-k) ^Pr M ^ (x,- , x^J I > e||z||2 

< 4(p - fc) cxp ^ ' ' 



A' Pr {Ar 



(36) 



, 8(l + a-i)2^2l 

where (&) follows from the union bound and the fact that the Trf s are identically (though not independently) 
distributed, while (c) follows from ( |35] ) and the fact that has a uniform distribution over ■ 
Note that Lemma[3]and Lemma |4] collectively prove through a simple union bound argument that an n x p design 
matrix X satisfies (fc, e, (5)-StOC for any e e [0,1) with S < 4pexp ^— iQ(^2+o-^y^ /i'^ ) ^^'^ ™y a > 1 as long as 
k < min {e^j/~^, (1 + a)^^p}. We are now ready to provide a proof of Theorem[T] 

Proof (TheoremU}: We begin by making use of the notation developed at the start of this section and writing 
the signal proxy / = X^y as / = X^Xyiz + X^i]. Now, let 11'' = (tt^+i , . . . , TTp) denote the last {p — fc) elements 
of fl and note that we need to show that ||/n<: ||oo < and min If-n-l > A in order to establish that S = S. 

ie{l,...,k} 

In this regard, we first assume that X satisfies (fc, e, (5)-StOC and define = max J^le\\z\\2,j^,2y/a^ logp} for 
any t e (0, 1). Next, it can be verified through Lemma|6]in Appendix lAl that fj = X^rj satisfies ||77||oo < Si/o^logp 



with probabihty exceeding 1 — 2{\/2'k \ogp ■ p) ^. Now define the event 

g = ||x satisfies (IStOC-lb and (IStOC-2b | 1 1|7?||^ < 2y/ g"^ logpjj 
and notice that we trivially have Pr(tJ) > 1 — (5— 2(v^27rlogp -p)^^. Further, conditioned on the event Q, we have 



(37) 



(a) 



/n^lloo < \\Xi.Xnz\ 



< elkll 



\xE^v\\ 



2y/a^ \ogp < A, 



(38) 
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where (a) follows from the triangle inequality, (b) is mainly a consequence of the conditioning on the event Q, 
and (c) follows from the definition of A^. Next, we define r = {X^Xn — I)z and notice that, conditioned on the 
event Q, we have for any i E |fc] = {1, . . . , fc} the following inequality: 

l/irj = \Zi + r.i +f)^i \ > \Zi\ - ||r||oo - ||?)||oo 

„ n / 

> /3mi„ - 442 - 2Vf^2 logP > /3min - A, . (39) 

Here, (d) follows from the conditioning on Q, while (e) is a simple consequence of the choice of Ag. It can therefore 
be concluded from (|38] | and ( [39] l that if X satisfies (fc, e, (5)-StOC and the OST algorithm uses the threshold Ae 
then Pr(5 7^ 5) < PriG") as long as ^^in > 2X,. 



Finally, to complete the proof of this theorem, let fc < n/(21ogp) and fix e = 10/i\/2 logp. Then the claim 
is that X satisfies (fc, e, (5)-StOC with 5 < 4p~^. In order to establish this claim, we only need to ensure that 
the chosen parameters satisfy the assumptions of Lemma |3] and Lemma |4] In this regard, note that (i) e < 1 
because of ( ICP-lt , and (ii) y/kv < | because of the assumption that fc < n/(21ogp) and ( ICP-21 ). Therefore, 
since the assumption p > 128 together with fc < n/(21ogp) implies that 16(2 + a^^)^ < 72, we obtain 
exp ^— iQ^X^^^iyz ^ < P^^- We can now combine this fact with the previously established facts to see that 
the threshold A = max |ilO/i-v/n • SNR, j^V^^y/2a'^ logp guarantees that Pt{S ^ S) < 6p^^ as long as 
n > 2fclogp and /3min > 2 A. Finally, note that 

^ qq _ t)~'^ 

/3mi„ > -4Vcr2 1ogp n> 2k\ogp 

1 — t bNKniin 

and 

I ^-2 \ 7/2 

^min > -20yuV2no^logp~SNR n > ^ 2fclogp) 

i V IVI A R / 

This completes the proof of the theorem. ■ 

2) Proof of Theorem^ We begin by making use of the notation developed earlier in this section and conditioning 
on the event Q defined in ([37b with e = 10/i-\/2 log p. Then it is easy to see from the proof of Theorem [T] that the 
estimate 5 is a subset of S because of the fact that ||/n<:||oo < A. 

Next, assume without loss of generality that Zi = l3(^i) and note from ( |39] | that l/jrj > — A for any 

i S {1, . . . , fc}. Then, since tt^ e 5 if and only if j/^^ | > A, we have that /3(i) > 2A =;> tt^ e S. Now define AI to 
be the largest integer for which > 2A holds and note that P{m) > 2A > 2A ^ iVi E S for every 

i E {1, . . . , A/}, which in turn implies that |5 — 5| < (fc — AI). Finally, note that 

/3(M) > -^4x/^^ ^ LARM>8(l-t)-^f^^) 
^ ' 1 — t \n ■ SNR / 



and 



/3(M) > j20^^/2n(T'^ logp- SNR .^=^ LARm > 400t ^ 



2fc logp 



This completes the proof of the theorem since the event Q holds with probability exceeding 1 — 6p ^ . 
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3 ) Proof of Theorem |6} The first key result that we will need to prove Theorem |6] is regarding the expected 
spectral norm of a random principal-submatrix of {X^X — I). The following result is mainly due to Tropp ||6l and 
it was first presented in the following form by Candes and Plan in ll20ll . 



Proposition 2 (||6l, 12011 ). Let 11 = (tti, . . . , tTj,) be a random permutation of \p\ and define 11 = (tti, . . . , TTfe) for 
any k S \p\. Then, for q ~ 2 logp, we have 



(e [\\XiXu < 2 V. (so, logp + u^mm}2il ) (40) 

provided that k < p/4:\\X\\2. Here, the expectation is with respect to the random permutation II. 

Using this result, it is easy to obtain a probabilistic bound (with respect to the random permutation IT) on the 
minimum and maximum singular values of a random submatrix of X since, by Markov's inequality, we have 



that Pr {\\X^Xu - /II2 > ?j < 
corresponding result presented in 



-'^n^'^^n ~ l\\2 



The following result is simply a generalization of the 



Proposition 3 (Extreme Singular Values of a Random Submatrix). Let H ~ (tti, . . . , tt^) denote the first k elements 
of a random permutation of |p] and suppose that fi{X) < {c'l logp)^^ and k < p/{c2^\\X\\2 logp) for numerical 
constants c'l = 60c and c'2 = 37c. Then we have that 



Pr [\\XlXu - I\\2 > e-^/'j < 2p-i . (41) 
Note that Proposition [3] guarantees that, under certain conditions on n(X) and k, every singular value of most nx k 



submatrices of X lies within (Vl — c^^/^, VT+IF^). We are now ready to provide a proof of Theorem |6] that 
relies on this key result as well as on Lemma [3] and Lemma |4] 

Proof (Theorem^: The proof of this theorem follows along somewhat similar lines as the proof of Theorem[T] 
Specifically, by making use of the notation developed at the start of this section, we write / = X^y = X^Xjjz 
and first argue that the set of indices I ^ {i £ |p] : |/j| > A} is the same as the true model S with high probability. 
Then we make use of the union bound and argue using Proposition [3] that (3 ~ /3 with high probability. 

In this regard, recall that it was established in the proof of Theorem [T| using Lemma [3] and Lemma |4] that if 



X obeys the coherence property then it satisfies (fc, e, (5)-StOC with e = 10fJ,^/2\ogp and S < 4p^^ as long as 
k < n/{2 logp). This fact therefore implies that, under the assumptions of the theorem^ the following inequalities 
hold with probability exceeding 1 — 4p^^: 

||/nc||oc^ = il^H Xnzlloo < e||2||2, and (42) 

mill > /3,„i„ - WiX^Xu - I)z\U. > /3„,i„ - e|lz|l2. (43) 

ie{i,...,fc} 

'Note that the assumptions of the theorem trivially guarantee the condition k < n/(2 logp) since we have that ||X||2 > p/n from elementary 
linear algebra. 
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Also note that, conditioned on the probability event E = {^X^Xu — / [l^ < c ^/^}, we can write 

v/l-c-i/2||2j|2 < \\Xnz\\2 < v/l + c-i/2j|z||2. (44) 
= v 

Therefore if we condition on the event £ then it trivially follows from the assumptions of the theorem and (l42l i and 
(l43T l that I = S with probability exceeding 1 — 4p~^ since (i) I C 5 because j|/n<^||oo < /^^"^^^^ = A (cf. ( |42] |. 
(l44l i). and (ii) I D S because fc < /i^^MAR/(c4 logp) implies that /3,ni„ — e||2;||2 > A niinig^i I/ttJ > A 
(cf. ( |43] |. (HU). Consequently, we conclude that (Xi)^ = {X^Xu)^^X^ with high probability when conditioned 
on the probability event £, which in turn implies that (3x = {XxY Xjjz = Ps with probability exceeding 1 — Ap~^ 
when conditioned on £. The claim of the theorem now follows trivially from the union bound and the fact that 
Pr(£^) < 2p^^ from Proposition [3] since X satisfies the strong coherence property and k < p/(c§||X||2 logp). ■ 

VI. Conclusions 

In the modern statistics and signal processing literature, the lasso has arguably become the standard tool for 
model selection because of its computational tractability ifTTl and some recent theoretical guarantees ||2l, lfT8l - ll20l . 
Nevertheless, it is desirable to study alternative solutions to the lasso since (i) it is still computationally expensive 
for massively large-scale inference problems (think of p in the millions), (ii) it lacks theoretical guarantees beyond 
fc >^ /i^^ for the case of generic design matrices and arbitrary nonzero entries, and (iii) it requires the submatrices 
of the design matrix to have full rank, which seems reasonable for signal reconstruction but appears too restrictive 
for model selection. 

In this paper, we have revisited two variants of the oft-forgotten but extremely fast one-step thresholding (OST) 
algorithm for model selection. One of the key insights offered by the paper in this regard is that polynomial-time 
model selection can be carried out even when signal reconstruction (and thereby the lasso) fails. In addition, we 
have established in the paper that if the n x p design matrix X satisfies x n^^/^ and iy{X) ^ n^^ then 

OST can perform near-optimally for the case when either (i) the minimum-to-average ratio (MAR) of the signal is 
not too small or (ii) the signal-to-noise ratio (SNR) in the measurement system is not too high. It is worth pointing 
out here that some researchers in the past have observed that the sorted variant of the OST (SOST) algorithm at 
times performs similar to or better than the lasso (see Fig. |2] for an illustration of this in the case of an Alltop 
Gabor frame in C^^^). One of our main contributions in this regard is that we have taken the mystery out of this 
observation and explicitly specified in the paper the four key parameters of the model-selection problem, namely, 
fi{X),i'{X),MAR, and SNR, that determine the non-asymptotic performance of the SOST algorithm for generic 
(random or deterministic) design matrices and data vectors having generic (random or deterministic) nonzero entries; 
also, see lITOl for a comparison of our results with corresponding results recently reported in the literature. 

The second main contribution of this paper — which completely sets it apart from existing work on thresholding 
for model selection — is that we have proposed and analyzed a model-order agnostic threshold for the OST algorithm. 
The significance of this aspect of the paper can be best understood by realizing that in real-world applications it 
is often easier to estimate the SNR and the noise variance in the system than to estimate the true model order. In 
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Fig. 2. Numerical comparisons between the performance of tlie SOST algorithm (Algorithm |3) and the lasso 1171 using an AUtop Gabor 
frame. The n X p design matrix X has dimensions n = 127 and p = v? , the MAR of the signals is 1, the SNR in the measurement system 
is 10 dB, and the noise variance is cr^ = 10~^. The matrix-vector multiplications are carried out using the fast Fourier transform, while the 
lasso is solved using the SpaRSA package |53| with the regularization parameter set to r = 2\/2a^ logp 1201 . (a) Plots of the fraction of 



detections, defined as fo 



\SnS\ 



and the fraction of false alarms, defined as fp^ 



d^l \S\-\SnS\ 



versus the model order (averaged over 



200 independent trials) for both SOST and the lasso, (b) Plots of the amount of time (averaged over 200 independent trials) that it takes SOST 
and the lasso to solve one model-selection problem versus the model order. 



particular, we have established in the paper that the threshold A = max |ilO/i-\/n • SNR, \/2j ■\/2<t^ log p for 
t £ (0, 1) enables the OST algorithm to carry out near-optimal partial model selection. It is worth pointing out 
here that this threshold is rather conservative in nature for small-scale problems (see (|5]l) and we believe that there 
is still a lot of room for improvement as far as reducing (or eliminating) some of the constants in the threshold 
is concerned. In particular, it is easy to see from the proof of Theorem [T] that the constant 10 in the threshold is 
mainly there due to a number of loose upperbounds; in fact, this constant was 24 in a conference version of this 
paper (|5| and we believe that it can be reduced even further. Some of the numerical experiments that we have 
carried out in this regard also seem to lend credence to our belief. Specifically, Fig. |3]reports the results of one such 
experiment concerning partial model-selection performance of the OST algorithm in terms of the metrics of fraction 
of detections, fn ^^'2'^^ , and fraction of false alarms, fpA ^-^^^—^g^^, averaged over 200 independent trials. 
In this experiment, the n x p design matrix X corresponds to an Alltop Gabor frame in C^''^, the noise variance 
is = 10^^, the MAR and the SNR are chosen to be 1 and 3 dB, respectively, and the initial threshold is set at 
As max jjc'^V" • SNR, j^V^^ ^/2aHogp with t = (\/2 - l)/\/2 and c' = 2t. It can be easily seen from 
Fig. |3] that OST successfully carries out partial model selection {fpA = 0) even when the threshold is set at 0.6As, 
which proves the somewhat conservative nature of the proposed threshold in terms of the constants. 

Finally, the third main contribution of this paper is that we have extended our results on model selection using 
OST to low-complexity recovery of sparse signals. In particular, within the area of low-complexity algorithms for 
sparse-signal recovery (such as, matching pursuit ll28l . subspace pursuit ||29l , CoSaMP ||30l , and iterative hard 
thresholding ISTI ). we have for the first time specified polynomial-time verifiable sufficient conditions under which 
recovery of sparse signals having generic (random or deterministic) nonzero entries succeeds using generic (random 
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Fig. 3. Pailial model-selection performance of the OST algorithm (averaged over 200 independent trials) coiresponding to an Alltop Gabor 
frame in C'"'^. The MAR of the signals in this experiment is 1, the SNR in the measurement system is 3 dB, and the noise variance is ct^ = 10^^. 



or deterministic) design matrices. In addition, we have also provided a bound in the paper on the average coherence 
of generic Gabor frames and used this result to establish that an Alltop Gabor frame in C" can be used together 
with the OST algorithm to successfully carry out model selection and recovery of sparse signals irrespective of the 
phases of the nonzero entries even if the number of nonzero entries scales almost linearly with n. 

Appendix A 
Concentration Inequalities 

In this appendix, we collect the various concentration inequalities that are used throughout the paper. 

Proposition 4 (The Azuma Inequality 1541 ). Let (fi, P) be a probability space and let (Afo, Mi, . . . , M„) be a 
bounded difference, (real-valued) martingale sequence on (fJ, J^, P). That is, ¥\Mi\ ~ Mi-i and |A/i — Afi_i| < bi 
for every i = 1, . . . , n. Then for every e > 0, we have 

i ^' ] 

Pr (|Af„ - Mol > e) < 2 exp . 



(45) 



Proposition 5 (Inner Product of Independent Gaussian Random Vectors 1481). Let x, y S R" be two random vectors 
that are independently drawn from A/'(0,<t^/) distribution. Then for every e > 0, we have 



Pry(x,y>|>.j<2cxp^-^-,^-,^J. (46) 

Since we are mainly concerned with complex-valued random variables in this paper, it is helpful to state a 
complex version of the Azuma inequality. The following lemma is an easy consequence of Proposition |4] 



July 5, 2010 



DRAFT 



29 



Lemma 5 (The Complex Azuma Inequality). Let (VL, T ^ P) he a probability space and let (Afo, Mi, . . . , M„) be 
a bounded difference, complex-valued martingale sequence on (fi, J-", P). That is, E[A/i] = Mi^i £ C and further 
I Mi — Mi_i I < hi for every « = 1, . . . , n. T/zen for every e > 0, we have 

Pr(|M„- Afol > e) < 4exp 



(47) 



Proof: To establish this lemma, first define 5*^ == Re (Mi) and Xi = Im(Mi). Further, notice that since E[il/i] = 
Mi_i and |Mi-M,_i| < b^, we equivalently have that: (i) ¥\Si] = 5"^-! and \S.,-S^-i\ < b,, and (ii) ^.[T,] = 
and \Ti — Ti_i| < 6^. Therefore, we have that (S'o, 5*1, ... , 5„) and (Tqi ^i, ■ ■ ■ , Tri) are bounded difference, real- 
valued martingale sequences on (51, J^, P) and hence 

< 4exp 



Pr (|M„ - Mol > e) < Pr f |5„ -So\>^j+ Pr (^|r„ - TqI > 



^\ ib) 



(48) 



where (a) follows from a simple union bounding argument and (6) follows from the Azuma inequality. 



Lemma 6 (£oo-Norm of the Projection of a Complex Gaussian Vector). Let X be a real- or complex-valued 
n X p matrix having unit £2-norm columns and let J] be a p x 1 vector having entries independently distributed as 
CJ\f{0,a'^). Then for any e > 0, we have 



Pr(||XMl 



> ere < 



Ap exp(-eV2) 



(49) 



Proof: Assume without loss of generality that (7 = 1, since the general case follows from a simple rescaling 
argument. Let xi , . . . , Xp e C" be the p columns of X and define 



X, 77, 1 = 1,, 



(50) 



Note that the z/s are identically (but not independently) distributed as Zi ^ CM{0, 1), which follows from the fact 
that rji CM{0, 1) and the columns of X have unit ^2-norms. The rest of the proof is pretty elementary and 
follows from the facts that 

Pr (||^^/!|oo > e) < P ■ Pr (|Re(zi)P + |Im(zi)|2 > e^) 

2p ■ 2Q(e) 



< 2p.Pr(^|Re(zi)| > 
M 4p exp(-eV2) 



V27r e 

Here, (a) follows by taking a union bound over the event IJilkil ^ (^) follows from taking a union bound 
over the event {|Re(zi)| > e/\/2} U {|Im(zi)| > e/\/2} and noting that the real and imaginary parts of z^'s are 
identically distributed as A/^(0, \), and (c) follows by upper bounding the complementary cumulative distribution 
function as (5(e) < --p==- exp(— ie'^) 1551 . ■ 
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